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We study the problem of consistent query answering under primary key violations. In 
this setting, the relations in a database violate the key constraints and we are interested 
in maximal subsets of the database that satisfy the constraints, which we call repairs. 
For a boolean query Q, the problem Certainty(Q) asks whether every such repair 
satisfies the query or not; the problem is known to be always in coNP for conjunctive 
queries. However, there are queries for which it can be solved in polynomial time. It has 

CD been conjectured that there exists a dichotomy for the complexity of Certainty(<5) for 

conjunctive queries: it is either in PTIME or coNP-complete. In this paper, we prove 
that the conjecture is indeed true for the case of conjunctive queries without self-joins, 
where each atom has as a key either a single attribute (simple key) or all attributes of 

i—i the atom (no key violations). 
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\Q 1. Introduction 

Uncertainty in databases arises in several applications and domains (e.g. data inte- 
gration, data exchange) . An uncertain (or inconsistent) database is one that violates the 
i— ( integrity constraints of the database schema. In this work, we will examine uncertainty 

under the well-studied framework of consistent query answering, established in pQ. 
• i-H In this framework, the presence of uncertainty generates many possible worlds, re- 

ferred usually as repairs. For an inconsistent database /, a repair is a subset of / that 
minimally differs from / and also satisfies the integrity constraints. For a given query 
Q on database /, the set of certain answers contains all the answers that occur in every 
Q(r), where r is a repair of /. The main research problem here is when the certain 
answers can be computed efficiently. 

In this paper, we will restrict the problem such that the integrity constraints arc only 
key constraints, and moreover, the queries are boolean conjunctive queries. In this case, 
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a repair r of an inconsistent database / selects from each relation a maximal number of 
tuples such that no two tuples are key-equal. We further say that a boolean conjunctive 
query Q is certain if it evaluates to true for every such repair r. In this case, we define 
the decision problem Certainty(Q) as follows: given an inconsistent database /, does 
Q(r) evaluate to true for every repair r of II 

For this setting, it is known that Certainty(Q) is always in coNP [2J. However, 
depending on the key constraints and the structure of the query Q, the complexity of the 
problem may vary. For example, for the query Q\ — R(x,y), S(y, z), Certainty(Qi) 
is not only in P but, since one can show that Certainty(Qi) can be expressed as a 
first-order query over / [5], it is in AC . On the other hand, for Q2 = R(x,y), S(z,y), 
it has been proved in [5] that Certainty(Q 2 ) is coNP-complete. Finally, for Q 3 — 
^fe y) ) S(y, x), one can show 12J that consistent query answering is in P, but the problem 
does not admit a first-order rewriting. 

From the three above examples, one can see that the complexity landscape is fairly in- 
tricate, even for the class of conjunctive queries. Although there has been major progress 
in understanding the complexity for several classes of queries, the problem of deciding the 
complexity of Certainty(Q) remains open. In fact, a long-standing conjecture claims 
the following dichotomy. 

Conjecture 1.1. For every boolean conjunctive query Q, Certainty is either in P 
or is coNP- complete. 

The progress that has been made towards proving this conjecture has been limited 
and is mostly focused on simple queries. In particular, the authors in [7j have recently 
proved a dichotomy into P and coNP-complete for the case where Q contains only two 
atoms and has no self-joins. Wijsen |llj has given a necessary and sufficient condition 
for first-order rewriting for acyclic conjunctive queries without self-joins; however, non 
first-order expressibility does not imply anything for the coNP-hardness of the query. 

In this work, we significantly progress the status of the conjecture, by settling the 
dichotomy for a large class of queries: boolean conjunctive queries w/o self-joins, where 
each atom has as primary key either a single attribute or all the attributes. Observe 
that this class contains all queries where atoms have arity at most 2; in particular, it 
also contains all three of the queries Qi,Q2,Q3 previously discussed. Our results apply 
for a more general setting where one might have the knowledge that some relations are 
consistent and other may potentially be inconsistent. Such an assumption may drop the 
complexity of the problem for coNP-complete to a lower complexity clas^J Our main 
result is stated as follows. 

Theorem 1.2 (Dichotomy). For every boolean conjunctive query Q w/o self-joins and 
atoms where the primary key is either a single attribute or all attributes of the atom, 
there exists a dichotomy of CERTAINTY (Q) into P and coNP- complete. 

Our techniques are based on analyzing a specific graph representation of the query, 
along with the key constraints. More precisely, we show in |Section 3| how to transform the 



Consider the coNP-hard query Q2 = R(x,y),S(z,y) and assume that we are given the fact that 
5 is a consistent relation. Then, the problem becomes easy to solve. In fact, the following condition 
is necessary and sufficient for whether Q2 is certain on an instance I: there exists a key-group R(a, — ) 
such that for every tuple R(a, b) £ /, there exists some c such that S(c, a) £ /. 
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initial query to one that contains only atoms with two attributes, and further, in every 
relation the key consists of exactly one variable. For example, the queries Q\,Q2, Q3 are 
already in this form. The query graph, which we denote by G[Q], has variables as edges, 
and a directed edge (x, y) exists whenever a relation R(x, y) exists. 

Given the graph G[Q], we give a necessary and sufficient condition for Certainty(Q) 
to be computable in polynomial time. Consider two edges eR = (ur, vr), e$ = {us, vs) in 
G[Q] that correspond to two inconsistent relations R,S. We say that eR,es are source- 
disjoint if ur,us do not belong in the same strongly connected component of G[Q]. 
Additionally, we say that eR,es are unsplittable if there exists (a) an undirected path 
Pr from cither Midpoints of eR to cither cndpoints of e$ such that no node in Pr is 
reachable from ur through a directed path in G — {gr\ and (b) a symmetrical path P$ 
where no node is reachable from ug through a directed path in G — {es}- Then: 

Theorem 1.3. Certainty (Q) is coNP- complete if G[Q] contains a pair of source- 
disjoint and unsplittable inconsistent edges. Otherwise, it is in P. 

Example 1.4. Consider the following two queries: 

Ki = R(x, y), S(z, w),T c (y, w) 

K 2 = R{x, y), S(z, w),T c (y, w), U c (x, z) 

Observe that the only difference between K\,Ki is the consistent relation U c . Moreover, 
the edges eR, es are source- disjoint in both cases, since there is no directed path from z 
to x. In G[Ki], the edges eR,e$ are also unsplittable. Indeed, consider the path P that 
consists of the edge e-r and connects y with w. The two nodes y,w of P are not reachable 
from neither x, z in the graphs G[K\] — {cr\, G[K\\ — {es} respectively. In contrast, 
the same path P is reachable from x in G[K2\: indeed, we have the path U(x, z), S(z,w). 
Since no other path connects eR, es in G[K2], the pair eR, es is splittable. 

In order to show |Thcorcm 1.3[ we develop new techniques for efficient computation 
of Certainty(Q), as well as techniques for proving hardness. In particular, we start 
by analyzing in |Section 4 the case where G[Q] is a directed cycle (Q2 is such a cycle of 



length 2). We show that Certainty(Q) is in P for this case; this result generalizes the 



result of [12 for the query Q2- We build upon this result in Section 5 to show that, if 
G[Q] is a strongly connected graph (i.e. there is a directed path from any node to any 
other node), Certainty(Q) is also in P. For this result, our algorithms depend on a 
novel use of or-sets. For the final piece of the puzzle, we describe in |Scction 6| a recursive 
decomposition of the graphs G[Q] that are tractable, which allows us to solve efficiently 



the most general case. Our hardness results are presented also in Section 6 where we 
show that we can reduce the NP-hard problem MONOTONE-3SAT to any graph G[Q] 
that does not satisfy the condition of |Thcorcm 1.3| 

In the specific case where G[Q] is a directed acyclic graph (DAG), every query that is 
in P is additionally FO (First-Order) expressible. In this case, we can further strengthen 
the our result to a dichotomy in FO-expressible and coNP-complete. 



2. Related Work 



The consistent query answering framework was first proposed by Arenas et al. in pQ . 
Fuxman and Miller [5] focused on primary key constraints, with the goal of specifying 
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conjunctive queries where Certainty(Q) is first-order expressible, i.e. can be repre- 
sented as a boolean first-order query over the inconsistent database. They presented a 
class of acyclic conjunctive queries w/o self-joins, called C/ ores t, that allows such first- 
order rewriting. Further, Fuxman et al. @j designed and built a system that supported 
the query rewriting functionality for consistent query answering. 

In a series of papers P21Q2], Wijsen improved on the results for first-order express- 
ibility. The author presented a necessary and sufficient syntactic condition for the first- 
order expressibility for acyclic conjunctive queries without self-joins. In a later paper, 
Wijsen gave a polynomial time algorithm for the query Qi = R(x, y), S(y, x), which 
is known to be not first-order expressible. Q2 is the first query that was proven to be 
tractable even though it does not admit a first-order rewriting. Finally, Kolaitis and 
Pema [7] proved a dichotomy for the complexity of Certainty (Q) when the query has 
only two atoms and no self-joins into polynomial time and coNP-complete. 

A relevant problem to consistent query answering is the counting version of the prob- 
lem, i.e. given a query and an inconsistent database, count the number of repairs that 
satisfy the query. Maslowski and Wijsen [9] showed that this problem admits a dichotomy 
in P and $:P-complete for conjunctive queries without self-joins. 

Finally, we should mention that the problem of consistent query answering is closely 
related to probabilistic databases, and in particular dis joint-independent probabilistic 
databases [3J. In this setting, the set of possible worlds can be described by an incon- 
sistent database. For example, for a key group R(a,—), each tuple R(a,b) has some 
probability P(a, b) such that X)fjP( a ;^) < 1- A possible world is then constructed by 
choosing, independently for each key group, at most one tuple with the corresponding 
probability. The authors in [3J prove a dichotomy for the complexity of computing the 
probability of boolean conjunctive queries w/o self-joins into PTIME and #P-complete. 
To see the connection with repairs, suppose that, for each key group, we assign equal 
probabilities for each possible tuple such that their sum is 1. Then, a possible world 
corresponds to a repair of the database; hence, for a query Q, P(Q) = 1 if and only 
Q is certain, i.e. every repair satisfies Q. Consequently, we can reduce Certainty(Q) 
to computing probabilities in disjoint-independent databases: if the complexity of com- 
puting the probability is in P, Certainty(Q) can be also computed in polynomial time. 
However, the inverse does not hold: indeed, it is #P-complete to compute the probability 
for the query Q 2 , but Certainty(Q 2 ) is in P, as we previously stated. 

3. Preliminaries 

Let I be an inconsistent database instance with primary key constraints, where all 
relations have arity < 2. We consider a generalization of the typical setting, where 
we may have the knowledge that some relations are consistent, i.e. do not violate the 
key constraints: we will denote such a relation by R c . Otherwise, we call the relation 
inconsistent and we denote it by R 1 . 

Formally, each relation contains exactly one primary key, which may consist of one or 
two attributes. For a unary relation {7(a;^ the only attribute x will be the key: notice 
that U must be consistent. For a binary relation R, the key constraints may be of the 



2 we will denote the key variables in a relation by underlining them 
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form R(x,y), where x is the key, or R(x,y), where both attributes are the key: again, 
notice that R in this case will be a consistent relation. 

We also define key-group to be all the tuples of some relation R that have the same key. 
For example, given a relation R(x,y), the tuples of the form R(a, — ) define a key-group 
with a as the key. In general, we will denote this key-group by R(a, — ). 

Definition 3.1 (Repair). A database instance r is a repair for the inconsistent database 
I if (a) r satisfies the key constraints and (b) r is a maximal subset of I that satisfies 
property (a). 

The database I may admit several repairs; we are interested in extracting the answers 
for Q that appear in every repair. More formally: 

Definition 3.2 (Consistent Query Answering). Given an instance I, and a conjunctive 
query Q, we say that a tuple t is a consistent answer for Q if for every repair r C I, 
t e Q(r). If Q is a boolean conjunctive query, we say that Q is certain for I, denoted 
1 1= Q, if for every repair r, Q(r) is true. 

Let Certainty (Q) denote the decision problem of whether a boolean conjunctive 
query Q is certain or not for a given instance I. 

From now on, we will assume that any query is a boolean conjunctive query with our 
self-joins, where atoms can be unary or binary. In the remaining section, we will describe 
how to simplify the structure of our problem and also provide several useful notations. 

3.1. Frugal Repairs 

We define here the notion of a frugal repair. Let Qf be the corresponding full query 
for Q, i.e. Qf contains all body variables as head variables. 

Definition 3.3 (Frugal Repair). A repair r of I is frugal if there exists no repair r' such 
that Q f (r') C Q f (r). 

It is easy to see that the following claim holds. 

Proposition 3.4. I \= Q if and only if every frugal repair of I satisfies Q. 

Proof. The one direction is straightforward: if some frugal repair does not satisfy Q, then 
Q can not be certain. For the other direction, assume that every frugal repair satisfies 
Q. Consider any repair r. Let r' be the minimal repair such that Q f (r') C Q f (r). Then, 
r' is a frugal repair and hence Q(r') is true. But then, Qf(r') ^ 0, which implies that 
Qf(r) ^ and thus that Q(r) is true. □ 

Thus, it suffices to consider only frugal repairs when we are looking at the prob- 
lem: this will simplify the structure in several cases. The following lemma will be used 
throughout this paper. 

Lemma 3.5. Let a be value that does not appear in any Q^(r), where r is a frugal repair 
of I, and let I~ a C I s.t. every key-group that contains a has been removed. Then, I \= Q 
iffI- a ^Q. 
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3.2. Simplifying the Structure 

In this subsection, we will show how to simplify the structure of our problem. In 
particular, we will show how to transform any query that consists of atoms where the 
key is either a single attribute or all attributes to a query with only binary atoms, where 
each atom has as a primary key a single attribute. 

The first straightforward observation is that we can assume that the hypergraph 
for Q is connected; otherwise, we can solve Certainty(Q) for each of the connected 
components and decide that Q is certain if and only if every component is certain. We 
can assume henceforth w.l.o.g. that Q consists of a single connected component. 

The next step is to handle the unary atoms in Q. Let Q u be the query derived from 
Q by removing all the occurrences of unary atoms. Then, the following holds. 

Proposition 3.6. There exists a FO-expressible reduction from Certainty (Q) to Certainty (Q u ) 
and vice versa. 

Proof. Notice that every unary relation is consistent by definition, since the only attribute 
is the primary key. Let U(x) be such a unary relation and consider other appearances 
of variable x in the query. Consider any atom that contains i as a variable. Then, by 
|Lcmma 3.5 we can remove from this atom any key-group such that x assumes a value a 



and a 1j, since no frugal repair will contain a in the answer set. After this processing 
of I, U plays no role in whether a repair satisfies the query or not and hence can be 
removed to obtain a query Q~ u without the atom U(x). Notice also that the processing 
is FO-expressible. For the inverse reduction, we can add a unary relation U(x) to Q~ u 
such that it contains any value that appears where attribute x appears. □ 

Next, we show how to handle the atoms where the primary key consists of all the 
attributes: such an example could be R(x,y) or S(x,y,z). In the general setting, we 
are given an atom of the form R(xi, . . . ,Xk). Observe that the relation R will be al- 
ways consistent, since it is not possible to have any key violations. Given a query Q, 
define by Q c the query that is obtained by replacing R with k new consistent relations 
RKXjXi), . . . , R%{x, Xk), where a; is a new variable that does not appear in Q. We can 
show again the following complexity equivalence. 

Proposition 3.7. There is a FO-expressible reduction from Certainty (Q) to Certainty (Q c ) 
and vice versa. 

Proof. To reduce Certainty(Q c ) to Certainty(Q), we simply compute R{x\, . . . ,Xk) 
as the natural join of the relations . . . , Rk on the common variable x, where we have 
projected out the joining variable x. For the inverse reduction, we populate R\ , . . . , Rk by 
introducing, for every tuple R(ai, . . . , a^), k new tuples JZi((ai, . . . , a^), a%), . . . , Rk((a±, . . . , at), a,k). 
It is easy to see that every Ri is a consistent relations where the variable x is the primary 
key. Additionally, the two instances are equivalent w.r.t the repairs they admit. □ 

It now remains to deal with the case of relations that have arity > 3 and additionally 
have a single variable as primary key. For this, we need the following lemma. 

Lemma 3.8. Let Q be a query including an atom R(x, y\, . . . ,2/fe). Denote by Q s the 
query obtained by replacing R with R'(x, y), Sf(y, yx), . . . , SZ(y, yk), where y is a new vari- 
able. Then, there is a FO-expressible reduction from Certainty (Q) to Certainty (Q s ) 
and vice versa. 
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Proof. For the one direction, assume we have query Q, along with a database instance 
I. We transform / to an instance I s for query Q s as follows. For a tuple R(a, b\, . . . , b k ), 
we introduce in I s the tuple i?'(a, (61, . . . , b k )) and also, for i = l,...,fc the tuples 
Si((&i, . . . , 6fe), 04). Observe that our construction guarantees that Si are consistent rela- 
tions. It suffices to show that / f= Q iff I s 1= Q s . Notice that there is a one-to-one corre- 
spondence between repairs of J, I s . Indeed, if some repair r of / chooses R(a, b%, . . . ,b k ), 
the corresponding repair r s of I s will choose R(a, (£>i, . . . , b k )) and vice versa. Now, 
observe that if Q(r) evaluates to true, so will Q(r s ) and vice versa. 

For the inverse direction, assume we have Q s and an instance I s . We transform I s 
to an instance / of Q by constructing R(x,yi, ...,y k ) = R'{x,y), Si(y,yi), S k (y,y k ), 
i.e. in order to construct R, we join all relations on y and then project out y. We will 
show that / N Q iff I s \= Q s . First, assume that I s N Q s ; we will show that I \= Q. 
Indeed, consider a repair r of / and construct a repair r s that makes the same choices as 
r for all common relations between Q, Q s and, if R(a, b%, . . . , b k ) €= r, then R'(a, b) G r s 
for some & such that Si(b, bi) € I s for every i = 1, . . . , k (note that by our construction 
such a b always exists). Since Q(r s ) is true, Q(r) will be true as well. 

For the inverse, assume / N Q and consider a repair r s of I s . Notice first that, for a 
key group i?'(a, — ), if R'(a,b) and 3z : Si(b,~) ^ I s , a will never contribute towards an 
answer for Q s , hence we can throw away w.l.o.g. such a key-group from consideration. 
Let i?'(a, — ) be any key group in I s ; equivalently, R(a,—) is a key group in /. Now, 
let R'(a,b) be the unique tuple in r s from this key-group. As we have argued, there 
exist tuples Si(b, &i), . . . ,Sk(b, b k ) in I s (and r s ) and these tuples are unique. By our 
construction, the instance / contains the tuple R(a, &i, ... , b k ): this is the tuple that we 
include in r. Since Q(r) evaluates to true, so must Q s (r s ). □ 

By combining the lemmas presented above, we can transform the initial query to a 
query Q that contains only binary atoms where exactly one attribute is the key, and each 
such binary relation may be consistent or inconsistent. From now, we will assume that 
every query will be of this form. 

The Query Graph. We now describe how to construct a directed graph G[Q] from Q, 
which we will often denote simply as G if the context allows it. The vertex set V(G) 
contains all the variables that appear in Q. The edge set E(G) contains, for each atom 
R(u, v), a directed edge en — (u, v). If R is a relation in Q, we will denote by ur and vr 
the starting and ending node of the edge en respectively. 

Further, we will call an edge consistent (inconsistent) if the corresponding relation in 
consistent (inconsistent). The set of all inconsistent edges will be denoted by E % . We 
will also use the notation x — > y to mean a directed path from node x to node y where 
every edge is consistent; otherwise, the directed path will be denoted by x ~-> y. Let 
us also say that a value a £ Doitlg(x^ maps to a value Domciy) f° r a repair r over a 
directed path P — R±, . . . , R k if there are tuples i?i(a, Ci), i? 2 (ci, c 2 ), . . . , R k (c k -i,b) in 
r. The notation x 4-> y will refer to an undirected path between nodes x and y. 

Finally, since Q uniquely defines G[Q] and vice versa, we will often use the notation 
G(r) to denote Q(r), where r is some repair. 



3 DoniQ(x) denotes the set of values a for which there exists a key-group R(a, — ) £ /, where R(x, y) £ 

G 

7 




Figure 1: The query graph G[Q a ] for 
whereas the straight edges consistent relations. 



Example 3.9 



The curly edges denote inconsistent relations, 



Example 3.9. Consider the following query: 

Q a =Ri(x, y),T c (y, z), S c (u, y), R 2 (u, z), R 3 (u, v), U c (u, z) 



The graph G[Q a ] is depicted in Figure 1 The curly edges denote inconsistent edges 



(Ri, R 2 , Ra), whereas the straight edges denote consistent ones (S,T,U). Moreover, x 
z, since there is a directed path through Ri,T that goes from x to z in G[Q a \- Also, 
u — » z, since the directed path through S, T has only consistent edges. 



Double Paths. Nothing prevents G from being a multigraph, i.e. there may be several 
edges that go from node x to node y. We will next show how to deal with the case, and 
in order to do this, we discuss a more general lemma that is crucial for our results. This 
lemma holds for frugal repairs only; it does not hold for any repair in general. 

Lemma 3.10 (Double Paths). Consider two edge-disjoint paths P\, P 2 of the form x 
y. Moreover, assume that there exists a value a € Dom,Q{x) such that there exist repairs 
that map a to two distinct 61,62 € Domciy) over path P 2 . Then, for any frugal repair 
r, no tuple in Q*(r) will contain a. 

Proof. For the sake of contradiction, assume that the claim does not hold. Then, there 
exists a frugal repair r such that a appears in some tuple t £ Q* (r); moreover, this a 
maps to some 6 over path Pi for r. Assume w.l.o.g. that b ^ b\. By our assumption, 
there exists a repair r' that maps a to 6 2 7^ 6 over path P 2 . Then, create a new repair 
r" by modifying r such that it agrees along P 2 with the choices of r" that map a to 62. 
Notice that r" maps a to 6 along Pi and a to 62 along P 2 \ hence, no tuple that contains 
a for the variable x can belong to Q' (r"), whereas such a tuple belongs in both r by our 
assumption. Moreover, this modification of r does not produce any additional answers 
in Qf(r"); hence, Q^(r") C Q^{r), a contradiction on the frugality of r. □ 

|Lcmma 3. 10| tells us that, in the case where some value a g Doitig(x) can map to 
two distinct values of Doma(y), o will never appear in the answers of any frugal repair. 
Applying |Lemma 3.5 we obtain the following corollary. 



Lemma 3.11. Consider two edge-disjoint paths Pi,P 2 of the form x y. Then, we 
can assume w.l.o.g. that any repair maps a value a £ Dovtlq^x) to a unique value 
€ Dom G (y). 
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Example 3.12. Consider Q = R(x, z), S(x,y),T(y, z) and the database 

R(a, ci), S(a, b 1 ) 1 T(b 1 ,c 1 ),S(a, b 2 ),T{b 2 ,c 2 ) 

Q contains two edge- disjoint paths from x to z: the one is the single edge en, and the 
other the edges es, &t- Along en, every repair will map a to c\. However, along the other 
path and depending on the key-group S(a,—), a repair may map a to c\ or c 2 . In this 
case, we can obtain an equivalent problem by throwing away the tuples S{a, — ) from the 
instance. 

Our discussion implies that, whenever we find double paths, we can obtain an equiv- 
alent instance by adding constraints to the graph. More precisely: 

Corollary 3.13. IfG contains two edge- disjoint paths x ~-> y, we can create an equivalent 
instance for CERTAINTY () by adding a consistent edge R c {x,y) to G. In the specific case 
where there are multiple edges between x and y, they can all be removed and replaced by 
R c . 



Example 3.14. Continuing Example 3.12 we can assume now that Q is R c {x, z), S(x, y), T(y, z), 
since we can replace R with the consistent relation R c . 

Consequently, the structure of our problem is further simplified: we can assume from 
now on w.l.o.g. that G[Q] is a simple graph, i.e. there are no multi-edges. 



4. Cycles 

In this section, we study the case where G[Q] is a directed cycle, i.e. the query is 
of the form C'k — Ri(x 1 ,X2), ■ ■ . , Rk(x k , x\). We will first prove that Certainty (C^) 
can be decided in polynomial time for any k > 2. Recall that |12j presents a polynomial 
time algorithm for the query C 2 ; our algorithm is a generalization of Wijsen's algorithm. 
Then, we will show that one can not only solve the problem efficiently, but one can 
represent the set of all possible answers for the full query C k concisely for frugal repairs. 
The latter fact will be important when we describe the PTIME algorithm for strongly 



connected graphs in Section 5 



4.1. A PTIME Algorithm for Cycles 

Our running example for this section will be the cycle query C3 = R(x, y), S(y, z),T(z, x) 
and the inconsistent instance will be the one presented in |Figure~2| We start by defining 
the notion of a partial repair. 

Definition 4.1 (Partial Repair). Let I be an inconsistent database. A subset r p C I is 
called a partial repair for a boolean query Q if: 

• r p is consistent 

• r p is closed for Q: if R(a,b) 6 r p , where en — (u,v), and v has an outgoing edge 
e s = ( u j u>), either S(b, c) € r p for some c, or there exists no key-group S(b, — ) in 
I. 
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Figure 2: An inconsistent instance I for the cycle query C3. 

In our example, the subset {i?(os, 65), S , (&5, C5), T(cs, 05)} is a partial repair. On the 
other hand, the subset of I {i?(a,3, 63), 5(63, C3), T(cs, 04)} is not, since, although it is 
consistent, it is not closed for C3. Indeed, there exists a key- group R{a^, — ) and the 
subset contains no tuple from this key-group. 

For a partial repair r p , let I — r p be the instance that occurs if we remove from / all 
the key-groups R(a, — ) for which there exists some 6 such that R(a, 6) € r p . Note that 
a normal repair r is always a partial repair. We can now prove the following lemma. 
Although this section focuses on cycles, we prove a more general version that holds for 
any strongly connected graph. 

Lemma 4.2. Let G[Q] be a strongly connected graph. Let r p be a partial repair of I such 
that Q(r p ) is false and consider V = I — r p . Then, I \= Q iff I' \= Q. 

Proof. We first show that I' N Q implies L \= Q. Let any repair r of /. Then, define 
r' = r — r p C r (where r — r p does not mean set difference, but removing from r all keys 
in r p ); notice that r' is a repair of V . Since I' \= Q, Q{r') is true. By the monotonicity 
of Q, Q(r) will also evaluate to true. 

For the inverse direction, we show that I \= Q implies L' \= Q. For the sake of 
contradiction, suppose that V \f Q. Then, there exists a repair r' C I' that does not 
satisfy Q. Let r = r' U r p . Observe that r is a repair for /; hence, Q(r) evaluates to true. 

Then, there exists a valuation that satisfies Q: let T — {ti, . . . ,tk} be the tuples 
from Ri , . . . , Rfr that correspond to this valuation (assume that Q has k atoms) . Clearly, 
not all tuples from T can belong in r' (since then r' would satisfy Q) or r p (since 
then Q(r p ) would be true). This implies that T contains a tuple T (a, bo) £ r p and 
T m (b m ,c) € r' (it may be that T = T m ). Since G[Q] is strongly connected, there exists 
a path in G[Q] from ut to ur m - Moreover, Q(r) is true: hence, there will be tuples 
Tx(6o, 22(61, 62), . . . ,T TO _x(6 OT _x, b m ) in the set T. Consider the first index % from 
1, . . . , m such that Tj(6i_i, 6,) ^ r p . Then, 6^_i has a key-group Tj(&i_i, — ) in /, but r p 
does not contain a tuple from the key-group. Since X"i_i(6i_2, &i— 1) G r p , this contradicts 
the closed condition for the partial repair r p . □ 

In our example, the partial repair 

r 34 = {-R(a 3 , 63), 5(63, c 3 ), T(c 3 , a 4 ), i?(a 4 , 64), S(b 4 , c 4 ), T(c 4 , a 3 )} 

does not satisfy C3. |Lcmma 4.2| tells us that, if we can find a partial repair that does 
not satisfy Q, we can equivalently solve the problem for a database instance of stricter 
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smaller size. But how do we compute such partial repairs? We discuss an algorithm 
that computes partial repairs for the cyclic queries Ck- The algorithm below is only for 
cycles, and does not extend obviously to strongly connected graphs. 

Let us construct a directed graph Fq(I) as follows: introduce a node for every value 
that appears in the instance I and construct an edge (a, b) whenever there exists a tuple 
R(a, b) € I. We can now prove the following lemma. 

Lemma 4.3. Consider a cyclic query Ck and an instance I. Define a reduction of I to 
I' as follows: 

1. If R(a, b) does not belong in any directed cycle in Fc k (I), I' = I — R(a, —) . 

2. If C is a simple directed cycle of length > k in Fc k (I), V = I — {R{a,—) \ 
3b : R(a,b) e C}. 

Then, I'tC k iffI£C k . 

Proof. For case (1), notice that any frugal repair will choose the tuple R(a,b) from the 
key-group R(a, — ) w.l.o.g., since this will never result in an query answer that contains 
a as a value. Hence, removing the key-group R(a, — ) will create an equivalent instance 
w.r.t. certainty for Ck- 

For case (2), it suffices to show that the cycle C corresponds to a partial repair of J: 



then, we can apply Lemma 4.2 to show the equivalence for 1,1'. Indeed, since the cycle 
is simple, i.e. contains no node twice, it is key-consistent. Moreover, it cannot satisfy 
Ck , since then it would contain a cycle of length k (a contradiction because the cycle is 
simple). Finally notice that for each tuple Ri(a, b), there exists a tuple Ri+i(b, c) for 
some c, since the edge Ri(a, b) belongs in a cycle. □ 

Note that one can detect whether there exists (and also find if it exists) a directed 
cycle of length > k in polynomial time for a fixed k. Indeed, consider all the directed 
paths p : a — ¥ b in graph Fc k (I) of length exactly k. There are polynomially many such 
paths, since k is fixed. For each path p, consider the graph Fc k {I) — p, where we remove 
all nodes of p except for the endpoints. If there exists a path from b to a in Fc k (I) — p, 
there exists a cycle of length > k in Fc k (I). Otherwise, no such cycle exists. Thus, the 
reduction defined in |Lcmma 4.3| is always in polynomial time. 

Definition 4.4 (Reduced Instance) . Iq is reduced for Ck if every edge belongs to a cycle 
in Fc k (Io), and Fc k {Io) has no cycles of length > k. 



Lemma 4.3 implies that, for every instance I, one can find a reduced instance Io such 
that I N Ck if and only if Jo N Ck- Since each reduction, as described in |Lemma 4.3[ is in 
PTIME, and removes at least one key-group, we can always find Iq in polynomial time. 
Moreover: 

Lemma 4.5. Let Iq be a reduced instance. Then, Iq \= Q if and only if Fc k (Io) has a 
directed cycle of length k. 

Proof. For the one direction, assume that Fc k (Io) contains at least one directed cycle of 
length k (note that the graph cannot have directed cycles of length < k) and consider 
any repair ro of Iq. Let Ri{a. b) be a tuple in Tq. Since b is not a dead-end node, there 
exists some c such that R i+ i(b,c) e r , and so on. Note that, since the nodes of the 
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Figure 3: The resulting instance Iq and the corresponding graph /'V; :i (io) for the instance in Figure 2 



graph are finitely many, at some point in this process we will meet again the same value. 
This will introduce a cycle: the cycle will be of length exactly k (otherwise it would have 
been removed by the algorithm), hence satisfying the query. 

For the inverse direction, if Fc k {Io) contains no cycle of length k, then trivially no 
repair of Iq can satisfy Cfc. Thus, Iq ¥ Cfc- ^ 

Our algorithm for Certainty (Cfc) can be summarized as follows. First, starting 
from /, we construct in polynomial time a reduced instance Iq by using the reductions 
in Lemma 4.3 we know that / 1= Cfc iff Iq 1= Cfc. Then, we simply check whether Fc k (Io) 
contains any cycles. If yes, / N Cfc, otherwise I \f Ck- 

The next example describes how the algorithm is executed for query C3 and the 
instance in |Figure~2} 

Example 4.6. In the first step, we find the partial repair r^, which is a cycle of length 6 
(pattern 2). Then, we create I' by removing all key-groups R(ci3, — ), S(b^, — ), T(c3,— ), 
as well as R{a±, — ), £(64, — ), T(c4, — ). In the second step, notice that the edge 5(65,03) 
has no more outgoing edges, since the key-group T(c3,— ) has been removed. Thus, we 
can construct I" by removing the key-group S(b 5 , —), which has two tuples S(bc,, C3) and 
S'(65,C5). For the third step, the edge ^(05,65) does have an outgoing edge, so it can be 
removed. The fourth and final step removes also the edge T(c5,Os). 



a 1 



Notice that the remaining instance Iq (see Figure 3) contains three cycles in Fc 3 (Iq) 

- 61 — Cl, CL\ — I 



— ci and 02 — 62 — C2 . Thus, we can conclude that 1 1= C3 . 

We have now proved the main theorem of this subsection. 

Theorem 4.7. For a cycle query Ck, where k > 1, Certainty (Ck ) is in P. 

Recall that [12] proves that the Certainty(C2) is not FO-expressible. It is easy to 
see that this can be generalized by the following claim: 

Proposition 4.8. For a cycle query Ck (k > 1), Certainty (Ck) is FO-expressible if 
and only if Ck contains at most one inconsistent edge. 
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4-. 2. Structural Properties 

In the previous subsection, we presented an algorithm that computes whether 71= Ck- 
If 7 \f Ck, there is no frugal repair of 7 that satisfies On the other hand, if 7 N Ck, 
every frugal repair will satisfy Cf.- If C[ is the full query that corresponds to the boolean 
query C , we are interested in representing all possible answers C[ (r) for every frugal 
repair r. Let A^(7) be the set of all frugal repairs. In order to represent all possible 
answers, we will use or-set notation, as presented in [8]. An or-set, denoted as (1,2,3), 
conceptually means that the output is either 1,2 or 3. Based on this notation, we can 
define a possible answer set of a frugal repair as an or-set over all C[(r): Ac k {I) — 
(Cl(r) | r £ A4(I)). Nevertheless, observe this representation may be prohibitive to 
store, since there may exist exponentially many frugal repairs. 

Notice that, for every query Q, the set of all answers Q{r) is of the type ({7\ x 
. . . x T k }) = ({T}), where T = T\ X • • • X Tfc. When we restrict to frugal repairs, the 
set Aq has the same type, but the sets are "independent", in the sense that no two 
sets in the or-set Aq are contained within each other. Now, we consider a different 
type of representation, {(T)}, which is a set of or-sets. [H] defines the a operator, which 
maps a type {(T)} to ({T}). For example, a{(l, 2), (3, 4)} = ({1, 3}, {1, 4}, {2, 3}, {2, 4}). 
In general, applying a to a set of or-sets will not result in an independent or-set, e.g. 
a({(l,2), (1,3)}) = ({1}, {1,3}, {1,2}, {2, 3}). However, if the or-sets are disjoint, it will 
always be independent. Thus, our goal is to represent the possible query answers as a set 
of or-sets, i.e. Ac k (I) = {Oi, ■ ■ • , Ok}, of type {(T)}, where the or-sets Oi, ... ,0k are 
mutually disjoint. This representation is not always possible to construct: for example, 
the set A' — ({1, 2}, {1, 3}, {2, 2}) can not be written in such a form. We will show, 
however, that for cyclic queries it is always possible to construct such a polynomially 
large representation. 

Recall from the previous section that the first part of our PTIME algorithm creates 
an equivalent reduced instance Iq of 7. We can thus only study frugal repairs for the 
instance Iq. Further, recall that Fc k (Io) contains only directed cycles of length exactly 
k, and every node in the graph belongs in a directed cycle. We next describe the steps 
of our algorithm. 

We first claim that the graph Fc k (7 ) contains a collection of disjoint strongly con- 
nected components (SCC). 

Lemma 4.9. 7/7o is a reduced instance of I, the graph Fc k (Io) is a collection of disjoint 
SCCs. 

Proof. Consider two nodes u, v in the graph. We will show that either there exists a 
directed path from u to v or the two nodes are completely disconnected (in the sense 
that there is no undirected path connecting them). Assume that there is an undirected 
path P : u = Wi, . . . ,W£ — v. Take any pair of consecutive nodes Wi,w i+ i. Then, there 
exists an edge (wi,uii + i) or (iWj+i, u>i). However, since Fc k (I) is reduced, every edge 
belongs in a cycle. Thus, for edge (wi, Wi+i), there exists a directed path Wi+i — > u>i and 
similarly for the other case. This implies that u, v are also connected through a directed 
path. □ 

Consequently, the graph Fc k (7) can be described as a collection of disjoint strongly 
connected components A\ , . . . , A rn . It is now easy to see that any frugal repair can 
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independently decide how to "repair" each such SCC. For each SCC A, we now introduce 
an or-set Oa in Ac k (I), defined as follows. We consider the query C[ without any key 
constraints and evaluate this over the SCC A (which is a subinstance of /): let t\, . . . , tt 
denote the answers. Then, Oa = (ii,... ,tt). 

Example 4.10. In our running example (see \Figure 3\f , the reduced instance Io contains 
exactly one connected component, which has three cycles. Hence, the set Ac 3 contains 
exactly one or-set: 

AcAh) = {((ai,6i,ci), (a 1 ,b 2 ,c 1 ), (a 2 ,b 2 ,c 2 ))} 

The following lemma shows that, for each SCC A in the graph Fc k (I), a frugal repair 
must contain as an answer from A exactly one of the cycles appearing in A, hence proving 
the validity of our construction. 

Lemma 4.11. For a SCC A in the graph Fc k (I), every frugal repair r contains exactly 
one query answer t G Cl{r) such that the cycle corresponding to t is in A. 

Proof. In order to show the lemma, for each directed cycle of length k in A, we will 
construct a repair of A that contains only this cycle as an answer to Ck ■ This proves the 
claim, since then any repair of A with at least 2 answers will not be frugal. 

Indeed, consider a cycle C that contains the nodes a%, . . . , a^. This defines a choice 
of tuple for each key group for a±, . . . , afc. Note that the subset of / that corresponds to 
C is a partial repair. We will show how to construct starting from Go — C a, subgraph 
Ge of A in a sequence of steps, such that for each step i, the following invariant is true: 
Gi corresponds to a partial repair r p and C[(r p ) contains only the cycle C as an answer. 

The invariant trivially holds for Go = C. Now, consider Gi. Since A is strongly 
connected, if Gi does not include every node of A (in this case, the construction has 
finished), there will be an edge i?(a, 6) G E(A) such that a G V(Gi) and b G V(A)\V(Gi). 
Construct G^+i by simply adding (a, b) (and of course node a) to Gi. Clearly, G^+i is 
key-consistent. Moreover, it is trivially closed, since Gi is closed and the only edge added, 
(a, b) has an endpoint that has an outgoing edge. Finally, we have to show that G^ + i 
contains only G as an answer to G&; in other words, Gi + i has exactly one cycle. Suppose 
not; then, the addition of (a, b) must have created a new cycle that goes through (a, b). 
In this case, Gi would have to contain an edge S(c,b) for some node c G V(Gi). But 
then, Gi would not be closed, a contradiction. □ 

To sum up our discussion, we have constructed a representation of Ac k {I) as a set 
of or-sets: Ac k (I) — {Oa \ A is a SCC of Fc k {I)}. Moreover, following our construction 
the sets Oa will be completely disjoint: no value will appear in two different or-sets. We 
have thus shown: 

Theorem 4.12. The or-set of possible answers for all frugal repairs Ac k {I) can be 
represented as a set of or-sets of polynomial size, where no two or-sets contain a shared 
value. 

5. Strongly Connected Graphs 

In this section, we will show how to compute Certainty(Q) when G[Q] is a strongly 
connected graph (SCC). The idea of the algorithm is that every strongly connected graph 
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can be decomposed into a collection of (possibly overlapping) cycles. In the previous 
section, we discussed not only how to solve the problem for cycles in polynomial time, 
but also how to compute the set of all possible answers for the corresponding full query 
as a set of or- sets. 

As a first step, we will describe a decomposition of G that will be suitable for our 
approach. In particular, we will show that one can build G bottom-up in a sequence of 
steps, where each step combines a SCG subgraph G s of G with a cycle C such that G 
and G overlap over consecutive edges of C or a single node. 

Lemma 5.1 (SCG Decomposition). Every SCG G can be built in a sequence of steps 
Gq, . . . , G m , such that each step Gi is a strongly connected graph and Gj+i = G; U C, 
where C is a cycle that overlaps with Gi only on consecutive edges or a single node. 

Proof. For the base of the induction, since G is strongly connected, it contains some 
simple cycle Go: let Go = Go- Next, consider a SCG Gi that is a subgraph of G. If 
Gi = G, our construction has ended. Otherwise, there must exist some edge e = (u, v) £ 
E(G) \ E(Gi). Since G is connected, we can reinforce this statement by claiming that 
there exists such an e that has at least one endpoint in Gj. 

Now, since G is strongly connected, there must be a simple path P vu from v to u in 
G. Consider the cycle G = P vu U {(it, v)} and notice that G overlaps with Gj at least 
at one point (since some of u,v belongs in V(Gi)): it may overlap with Gi in several 
non-consecutive segments. However, one can always find a segment G st of G such that 
only its endpoints s,t belong in d. 

If s = t, the cycle overlaps with Gi only in node s and we are done (since G is a simple 
cycle, i.e. a node appears only once). Otherwise, since Gi is also strongly connected, 
there exists a simple path P ts in Gi that connects t with s. To conclude the proof, 
consider the cycle C' — C st U P ts . By our construction, C' overlaps with Gi only on the 
consecutive edges of the path P ts . □ 

Example 5.2. Consider the following query Q: 

Q = R(x, y), S(y, z), T(z, x), U(x, z), W(y, t), V(t, z) 



A possible decomposition of G[Q] according to \Lemma 5.1 would be as follows. Let 



Go = R(x,y), S(y, z),T(z,x), which is a cycle of length 3. Then, G\ adds to Gq the 
2-cycle C± = T{z,x),U{x,z): notice that C± and Go overlap only on the edge T(z,x). 
Finally, G2 = Gi U C2, where G2 is the cycle R{x, y), W(y, t), V(t, z), T(z, x): again, the 
overlapping part between G\ and Gi is the cycle segment T(z,x),R(z,y). 

We will exploit this decomposition of G to compute the set Ag{I) inductively. The 
base case of the induction is a cycle Go, and in the previous section we have shown 
how to compute the set ^4^(7). Assume now that we have computed the set Ac^I) 
for a strongly connected subgraph Gi. We next show how to compute AG i+1 (I), where 
Gi+i = Gj U G for a cycle G. We distinguish two case, depending on whether u — v or 
not. 

Different endpoints (u ^ v). Let C vu denote the segment of the cycle that connects v to 
u. Then: 
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Lemma 5.3. For every frugal repair of I, any b £ DorriQ{v) maps to a unique m{b) £ 
DoniQ^u) along C vu . 

Proof. By our construction, no edge in C vu belongs in Gi . Additionally, since both u, v 
belong in V{Gi), there must exist a path P vu that connects v to u and completely resides 
inside G;. Hence, C VU ,P VU are two edge-disjoint paths that connect v to u. Applying 
|Lcmma 3. 11| proves our lemma. □ 

By the inductive step, we can compute Ad (I) as a set of or-sets of polynomial size. 
Consider an or-set A £ Aq^I). For any tuple t £ A, let t u ,t v be its values for the u,v 
variables respectively. We now distinguish two cases: 

1. There exists a tuple t £ A such that t v does not map to t u through C vu . Then, 
no frugal repair of Gi+i can contain any answer from A, since one would always 
choose (independently from the other or-sets) tuple t from A: since t v maps to 
some m(t v ) ^ t u , t would never be part of an answer for Gj+i. 

2. For any tuple t £ A, m(t v ) — t u . In this case, any choice from the or-set A would 
result in an answer for Gj+i. 

Hence, we can filter the set Aq . (I) to obtain a set Bq. +1 (I) that describes the possible 
answers for the frugal repairs of the query that corresponds to Gj+x, but has head 
variables only the ones in V(Gi). If V{Gi + \) = V(Gi), Ac i+1 {I) = P>c i+1 (I). Otherwise, 
we need to show how to expand Ba i+1 to Ac i+1 so as to take into account the nodes of 

y(G i+1 )\y(G0. 

First, we will show that the choices for the edges in C vu are independent for each 
or-set B £ Bc i+1 (I). Indeed, consider a value a £ Domciu) (which will only appear in 
B). Let M(a) be all values in U ue v c Bomciu) that map to a. Notice that it for a =/= a', 
M(a) D M(a') = 0. Hence, the choices of a frugal repair for an edge in C vu will relate to 
exactly one or-set in Ba i+1 . 

For the final step, consider any tuple t € B. Consider two distinct paths P\,P2 
along C vu that map t v to t u . We will show that any repair can contain at most one 
of those. Indeed, suppose that both exist and consider the first edge en of C vu where 
Pi,P2 diverge. Then, the repair must contain some i?(c, d±), R(c, e^), where d\ ^= di, 
which would contradict the fact that it is key-consistent. Hence, the paths are mutually 
exclusive. This implies that, for each tuple t in the or-set, a frugal repair can choose 
exactly one path t v to t u . 

Hence, we construct Aa i+1 (I) from Ba i+1 (I) as follows. For each or-set B £ Ba i+1 (T), 
we construct a corresponding or-set Ab & Aa i+1 {I). For each tuple t £ B, compute the 
set S vu of the tuples query Q(wi, . . . , Wk) — C vu (v, wi, . . . , Wk,u), v = t v ,u = t u . This 
correspond to the paths that map from t v to t u along the edges of C vu . Then, add to 
A B the tuples in {t : t' \ t' £ S vu ] Q 

Same endpoints (u = v). In this case, the cycle C overlaps with C only on node v. 
Hence, the choices for C,Gi are independent. By induction, we have computed Ag^I) 
and we can also compute Ac (I), since C is a cycle. We next show how to combine the 
two sets to compute Ac i+1 {I). 



4 the notation t : t' denotes the concatenation of the two tuples. 
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We will instead present an algorithm for a more general case. Let G = G1UG2, where 
G±, G 2 are edge-disjoint graphs but join on the nodes V\, . . . , v m . Additionally, assume 
that, for an instance / of G, the sets A\ = Ac 1 (I),A 2 — Ac 2 (I) are representable. We 
will then show how to compute Aq{I) from A\,A 2 . 

For two or-sets Oi,02 in A\,A 2 respectively, define 0\ N 2 as the or-set that 
contains all tuples t that result from joining some t\ G 0\,t 2 G 2 . If t\,t 2 join on 
a G Dotuq^vi) x • • • x Domc(v m ), we say that 0\, 2 are a-joinable. 

We will reduce the problem of combining A\,A 2 to the problem of computing Ac 2 (Ic) 
for an appropriately constructed instance Ic- Recall that C 2 = R(x,y), S(y,x). 

Lemma 5.4. The problem of computing A G {I) from A x = A Gl (I),A 2 = Aq 2 (I) can be 
polynomially reduced to computing the set Aq 2 ■ 

Proof. Before we describe our reduction, we first simplify the problem by making two 
observations. 

First, consider some or-set O G A\ (similarly for A 2 ) that contains a G Domc{vi) x 
• • • x Domcivm), but is not a-joinable with any or-set from A 2 . Then, no frugal repair 
for I can contain an answer with a. This implies that we can w.l.o.g. throw away O 
from A\ (and similarly for A 2 ). 

Second, consider two or-sets Ox G A ll 2 G A 2 such that they are a, a'-joinable, 
where a ^ a'. Again, both or-sets can be thrown away w.l.o.g., since no frugal repair 
will even produce an answer from these or-sets: indeed, a frugal repair can always make 
Oi choose the one value and 2 choose the other value. 

We have thus come up with an instance where for every value a G Domc{vi) x • • • x 
Domc(v m ) contained in A\,A 2 , there exist two unique or-sets Of G A\ and 2 G A 2 
that are a-joinable. Additionally, Oi,02 can only be a-joinable. 

We now show how to create an instance Ic for the query G2. For each such value 
a, we add the tuples R(Of,0 2 ) and R(0 2 ,Of) in Ic- As we have discussed, we can 
compute A C2 (I C )- For an or-set A = ((0{, 0\), . . . , {0\, 0$)) G A C2 (I C ), let t(A) = 
U, 1 /, 0\ n Ol We claim that A G (I) = {t(A) \ A G A C2 (I C )}- 

Indeed, notice that the set {t(A) \ A G A C2 (Ic)} contains disjoint or-sets. Hence, 
it suffices to show that the set of answers G(r) of a frugal repair r of G belongs in 
{t(A) I A G Ac 2 {Ic)}- However, since G\,G 2 are edge-independent, r can be decomposed 
as the union of frugal repairs n , r 2 from A\ , A 2 respectively. The crucial observation is 
that n, r 2 correspond to a unique repair rc of Ic- This repair rc must be frugal; hence, 
G(r) will indeed belong to the set we have constructed. □ 

Summing up, we can construct Ac {I) as a set of or-sets of polynomial size, by using 
an appropriate decomposition of the query graph. If Aa{I) is empty, this implies that 
no frugal repair satisfies G; else, we know that every repair will satisfy G. We have thus 
proved the main theorem of this section. 

Theorem 5.5. The problem Certainty (Q) is in P when G[Q] is a strongly connected 
graph. Moreover, the set Aq(I) can be represented as a set of or-sets of polynomial size. 

6. The Dichotomy Theorem 

In this section, we prove that there exists a dichotomy on the complexity of Cer- 
tainty^) into PTIME and coNP-complete. We present a necessary and sufficient 
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condition that captures this dichotomy, which we call unsplittability and depends on 
structural properties of the graph G[Q]. 

6.1. Notations 

Let G s be any subgraph of G. For G s , let E l {G s ) be the set of inconsistent edges 
that appear in G s . Then, we define the following set of nodes. 

Reach(G s ) ={»£ V(G) | Vi? £ E\G S ) : u R -> v in G} 

fd(G s ) = {v£ V(G) | Vi? e £*(G a ) : u R ^ v in G - {e R }} 

In the specific case where the subgraph consists of a single inconsistent edge e R , we 
will use the notation Reach(R) and fd(R) respectively. In this case, Reach(R) is the set 
of nodes that are reachable from the source node u R through a consistent directed path. 
Similarly, fd(R) denotes the set of nodes that are reachable from u R through a possibly 
inconsistent directed path that does not go through e R . Formally: 

Reach(R) = {»£ V(G) \ u R v in G} 

fd(R) = {v£ V(G) \u R ^vmG- {e R }} 

Both sets fd and Reach will play a crucial role in proving the dichotomy result. As 
a warmup exercise, it is easy to show that they are related as follows. 

Proposition 6.1. For a subgraph G s of G, Reach(G s ) C fd(G s ). 

Proof. Indeed, if for some inconsistent edge e R £ E l (G s ) there exists a consistent path 
u R — ^ v, the path can not go through e R , since e R is inconsistent. □ 

Definition 6.2 (Source-Disjoint). We say that two inconsistent edges R,S are source- 
disjoint if u Rl us do not belong in the same SCC. Otherwise, they are source-joint. 

Notice that in the case where u R = us, both source nodes trivially belong in the same 
SCC, which includes the single node us and hence they are source-joint. Next, consider 
two source-disjoint inconsistent edges R, S. We next define when R, S are unsplittable. 
Let us say that a path P is cut by a set of nodes V if P n V ^ 0, i.e. some node of P 
belongs in the set V. 

Definition 6.3 (Unsplittable). Two source- disjoint inconsistent edges R, S are unsplit- 
table if there exists an undirected path P R between either endpoint of e R to either endpoint 
of es that is not cut by fd(R), and symmetrically a path Ps not cut by fd(S). Otherwise, 
we say that the pair of edges is splittable. 

Notice that P R may connect either the source or the destinations, i.e. it may go from 
u R to us, or from u R to vs, or from v R to us or v R to vs- Observe now that we can 
equivalently ask for a path P R not cut by fd{R) that specifically is between nodes v R 
and us- Indeed, u R £ fd(R), and hence P R can not start from u R . Additionally, if the 
path P R goes to vs, we can extend it to u s , since it can never happen that u s £ fd(R) 
if R, S are unsplittable. 

We can now state our dichotomy theorem, along with the necessary and sufficient 
condition. 
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Theorem 6.4 (Dichotomy Theorem). CERTAINTY (Q) is coNP- complete if there exists 
a pair of source-disjoint and unsplittable edges in G[Q]. Otherwise, it is in P. 



In the remaining section, we will prove |Thcorcm 6.4| In Subsection 6.2| we will prove 



the coNP-complete part of the theorem. In Subsection 6.3 we will complete the proof 
by presenting a polynomial time algorithm. 

6.2. The coNP- complete Case 



In this subsection, we prove the coNP-complete part of Theorem 6.4 



Theorem 6.5. If G[Q] contains two source-disjoint and unsplittable inconsistent edges, 
Certainty (IQ^ is coNP- complete. 

We will denote the two edges as R and S. Additionally, since we want to prove 
coNP-hardness, we can assume w.l.o.g. that any edge that is not R or S is consistent. 
Hence, R, S are from now on the only two inconsistent edges in G. We will first show 
the coNP-completeness for the case where G is a DAG. Then, we will extend the proof 
for arbitrary graphs that may contain cycles. 

6.2.1. DAGs 

We initially restrict the hardness proof for the case where G is a DAG. In this case, we 
can further simplify the hardness construction by transforming G such that every node 
in V(G) that is not one of the endpoints of en, es (these are the nodes ur, Vr, us, Vs) is 
either a sink or a source node. 

In particular, let v £ V(G) such that every incoming edge (tii, v), . . . , (ttfe, v) and 
outgoing edge (v, wi), ...,(«, wi) is consistent. Define the graph G T,V that is obtained 
by removing from v all outgoing edges and introducing the following new edges: (v,i,Wj), 
for i = 1, . . . , k and j = 1, . . . , I. 

Example 6.6. Consider a graph G that contains a node v with incoming edges R c {u, v), S c ( 
and outgoing edges T c (v, z). We obtain G T ' V by replacing the edge ex with two new edges: 
T£(u,z) andT^iwjZ). 

Lemma 6.7. There exists a PTIME reduction from Certainty (Gr^'^ to Certainty (G). 

Proof. Consider the graph G T,V with given consistent relations for the edges (ui , v) and 
(ui,Wj). Note that for any i = l,...,k and any value a £ Dom G T,v(ui), there exists 
a unique value c £ Dom G T,v(v) and unique values bj £ Dom G T,v(wj) for all j, since 
all the relations are key- consistent. Let m(a) — (bi, . . . , 6j, c) and let M be the set 
all the images m(a), i.e. M = {m(a) \ a £ Dom G T,v(ui),i = 1, . . . ,fc}. Now, let us 
construct the relations in G as follows: for edge (uj,w), introduce the tuple (a,m(a)) for 
every a £ Dom G T,v(ui). Moreover, for edge (v,Wj) introduce a tuple that couples every 
m € M with the j-th coordinate of m, m[j]. 

To show the equivalence, assume that <ii, . . . , au in G T,V belong in the same tuple in 
Gt,v(t"), for some repair r. Then, each a,i maps to the same c 6 Dom G T,v(v) and the 
same bj £ Dom G T,v(wj). Hence, in G, m{ai) — m(aa) = ••• = m(a^) and all will 
map to the same value for variables Wj and v in G, which implies that the same tuple 
will also belong in G(r). For the other direction, if ai, . . . , a& are in the same tuple in G, 
then m(ai) = •• • = m(afc). Hence each ai maps to the same bj and c in G T,V . □ 
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We can now apply Lemma 6.7 to create a graph G T where any node not in {ur, ug,VR,vs} 



is either sink or source and, if G T is coNP-hard, so is G. Indeed, if there exists a node 
v that is neither source nor sink, we can apply |Lemma 6.7| to reduce this to a hardness- 
equivalent graph where v is a sink node. By our construction, the transformation will 
increase the number of non-sink or non-source nodes by one, hence at some point we will 
obtain a graph G T that contains only sink or source nodes. 

The reduction to prove the coNP-hardness will be from Monotone-3Sat, which is 
a special case of 3Sat where each clause contains only positive or only negative literals. 
Monotone-3Sat is known to be a NP-complete problem jB]. Given an instance of 
this problem, let us denote by «3> the set of all clauses, X the set of all variables, X* 
the set of all literals and c* = {T, F} (true, false). Moreover, let us define top as 
T = $ x X* = {((/>, x*) | x* e cj), e $} and bottom, _L, to be a newly introduced 
constant. 

In order to show the reduction, we assign to each node u € V(G) a label L(u), 
which corresponds to one of the sets defined: T, _L, X, X* , c* . Notice that every edge 
ct = (ut,vt) will now correspond to an ordered pair of labels (L(ur), L(vt))] this pair 
will be later used to define how the relation T will be populated. We now say that a 
labeling is valid if the following conditions hold: 

1. For every consistent edge = (ut,vt) the labeling (L(ux), L(vt)) must satisfy 
one of the following conditions: 

(a) L(u T ) = L(v T )- 

(b) L(u T ) = T or L(v T ) = _L. 

(c) is one of ($,c*), (X*,X), (X*,c*). 

2. For the one inconsistent edge (e#), L(ur) — $ and L(vr) is one of X, X* 

3. For the other inconsistent edge (eg), L(us) = X and L(vs) is one of X* , c* . 

4. There exist two paths between en, es s.t. (a) every label of the first path is one of 
T, X, X* and (b) every label of the second path is one of T, $, X* , c* . 

It is not clear at a graph G[Q] always admits a valid labeling. However, we can show 
that if G[Q] is a DAG and has a pair of source-disjoint, unsplittable edges, such a labeling 
always exists. 

Proposition 6.8. A graph G[Q] that is a DAG and has a pair of source-disjoint and 
unsplittable edges admits a valid labeling. 

We will prove this proposition in the rest of the subsection; before that, we will show 
why a valid labeling is sufficient to prove the reduction from Monotone-3SAT. We will 
first specify how the labeling dictates the way we populate each relation of our instance. 
Recall that each edge er corresponds to an ordered pair of labels (L(ut), L(vt))- For 
each such pair, we will define a mapping L{ut) — > L(vt) and then populate T as follows: 
T = {(a, b) | a G L(ut), b € L(vt), cl — ¥ b}. Such a mapping can be inconsistent, in the 
sense that some a € L(ut) may be mapped to two or more values from L{vt)- We thus 
split the mappings into two categories, depending on whether they define consistent or 
inconsistent relations. 

As a special case, consider any pair where L(ut) — L(vt), for example ($,<&). In 
this case, the mapping is the identity mapping, i.e. every clause is mapped to itself. 
Clearly, this defines a consistent relation. A consistent relation is also defined for any 
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pair where L(vt) = -L: we map every a e L(ut) to the single constant _L. The mappings 
are formally defined as follows: 
Consistent: 

• $ — > c*: a clause maps to T if it has only positive literals, else to F. 

• X* — > X: a literal (x+ or a; - ) maps to the corresponding variable (x). 

• X* — > c*: a literal x + maps to T and a literal x~ to F. 
Inconsistent: 

• X — > X*: a variable (a;) is mapped to both corresponding literals (x + and x~). 

• X — > c*: a variable maps to both T, F. 

• $ — > X (<f> — > X*): a clause maps to all three the variables (literals) it includes. 

In addition to the three consistent mappings presented above, we should also mention 
that the trivial mappings where L(u T ) = L(v T ), L(u T ) = T or L(v T ) = _L also define 
consistent relations. Notice that this agrees with condition (1) for the validity of a 
labeling. 

Example 6.9. To show an example of how the relations are generated from our labeling, 
consider the formula Y — 0i A 02, where 0i = (x + V y + V z + ) and 2 = {z~ V w~ V t~). 
A relation that corresponds to the pair of labels (&,X) will be populated by the tuples 
(0i, x), (4>i, y), (0i, z) and (fa, z), (02, w), (02, t). Observe that the relation is inconsis- 
tent. For the pair ($, c*), we introduce the tuples (0i,T), (0 2 , F); notice that in this case 
the relation is consistent. 

Thus, given a valid labeling we can create a database instance using the mappings 
we just presented. We can now show that: 

Proposition 6.10. Let I be the database that corresponds to a valid labeling according 
to an instance M of Monotone-3Sat. Then, I \f Q if and only if M has a satisfying 
assignment. 

Proof. First, note that our construction guarantees that the consistent relations will cor- 
respond to a consistent mapping, whereas the two inconsistent relations R, S correspond 
to inconsistent mappings. 

Consider a satisfying assignment for M, where v(x) denotes the assigned value (true 
or false) for variable x. We will construct a repair r that does not satisfy Q. Since the 
assignment satisfies the formula, for every clause there exists a literal x* that evaluates 
to true. Then, for the relation R, r includes the tuple (<fi,x*) (if en has the $ — > X* 
mapping) or {(f), x) (when $ -4- X). As for relation es, if the mapping is X — > c* , r 
maps variable x to the opposite of v(x). Similarly for X — > X*: if v(x) is true, the tuple 
(x,x~) is chosen, else (x,x + ). 

It remains to show that Q{r) evaluates to false. For the sake of contradiction, assume 
that Q(r) is true and consider a tuple t G Qf(r). Since we have a path from e# (and in 
particular node vr) to eg such that every label has a consistent mapping to X, for the 
answer t, the values t[vft] and t[us] or t[vs] will correspond to the same variable x. Let 
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(a) Case 1 (b) Case 2 (c) Case 3 (d) Case 4 

Figure 4: Depicting the four cases for the hardness construction, along with the labeling of the four 
endpoints of the inconsistent edges: {uji,vji) and (ug,vs). The full lines depict the existing edges for 
each case, whereas the dotted edges depict possible edges that may or may not exist. 



t[un] = 4> and assume that it is a positive clause. Then, v(x) = T and hence t[vs] will 
be either false or x~. But this is a contradiction, since there exists a path from vs to 
some node of R where each label has a key-consistent mapping to {T, F}. 

For the inverse direction, assume that / has a repair such that Q(r) is false. We 
construct an assignment for the variables in M as follows: if the repair r contains a tuple 
(x,T) or (x,x + ), let v(x) — F; otherwise, v(x) — T. Now, consider a positive (w.l.o.g.) 
clause <j> of the instance M. Assume that r contains tuple (0, x) or (<j>, x + ). Using similar 
arguments as before, one can see that r cannot include (x,T) or (x,x + ); otherwise, Q(r) 
would evaluate to true. Hence, v(x) = T and clause <p will be satisfied. □ 



Thus, if there exists a valid labeling, G is coNP-hard. It remains to prove Proposi- 
|tion 6.8[ i.e. that such a valid labeling always exists. 

Example 6.11. As a warmup example, consider the following queries: 

H 1 =R c (x,z),S(x,y),T(y,z) 
H 2 = R(x,y),S(x',y) 

For Hi, we assign the labels as follows: L(x) = $, L(y) = X and L(z) = c* . Let us check 
whether our mappings are correct. The inconsistent relations S and T will correspond to 
the mappings $ — > X and X — > c* respectively. R c is assigned the consistent mapping 
$ — > c* . Condition (4) for validity is also trivially satisfied. 

As for Fli, it suffices to label L(x) — L(y) — X* and L(z) = X . 

We now describe the construction of the labeling for the general case, along with 
proving the validity of it. Every source node will always be labeled with T. Further, 
every sink node will be labeled with _L, unless otherwise specified. We distinguish several 
cases for our labeling, depending on the possible configurations of direct edges between 
the four nodes ur, Vr, us, vs- Note here that there can be no edge between ur, us (since 
Ur ^ fd(S) and us ^ fd(R)) in either direction and also that an edge and its inverse 
cannot coexist, since G is acyclic. Based on these observations, we study the following 
mutually exclusive cases, which cover all possible configurations: 

Case 1. In the first case, there exist two directed edges: (ug, Vr), (vs,Vr). Note 
that this configuration allows for a possible edge (vs,ur), but not its inverse (since then 
eR would be a consistent edge). We construct the labeling as follows: L(ur) = X, 
L{vr) — c* , L(us) — $ and L(vs) — X* . Now, since the edges eR,es are unsplittable, 
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there must be a path v$ O ur (and not to vr) that is not cut by us (labeled $): label 
every sink node on this path by X. 

| Figure 4a| depicts the labeling. We next show that it is valid. For this, it suffices to 
check that a possible edge [ut, wt), where vt is a sink node labeled with X , corresponds 
to a key-consistent mapping. In this case, ut can not be labeled as since us does not 
cut the path. It cannot be labeled as c* as well, because then us would again cut the 
path (since us has an edge to Vr). Hence, the only possible labels are X,X*,T, and in 
all case the mapping to X is key-consistent. 

Case 2. In the second case, the configuration includes the edge (us,vr), but not 
( v S, v r) °r (vr,vs)- Construct the labeling as follows: L{ur) = L(vs) = X, L(vr) = c* 
and L(us) = < I > . As in case (a), since eR, es are unsplittable, there exists a path vs <-> ur 
that is not cut by us (labeled $): as before, label each sink node in this path by X (notice 
that c* will not cut this path as well). 

| Figure 4b] shows that our labeling is valid, even in the case that we have an edge 
(ur,vs) or its inverse (since both endpoints are labeled by X). 

Case 3. In the third case, the configurations does not contain any of (ur, vs), {us, vr), 
but it contains the edge (vr,us). In this case, the edge (vs,ur) can not be part of the 



configuration, since it creates a cycle. The only other possible edge is (vr,vs) (see Fig- 



ure 4c). The labeling is constructed as follows: L{ur) — <&, L(vr) — L(vs) — X* and 
L(us) = X. The unsplittability of eR, es guarantees a path from vs to either ur or vr 
that not cut by us (labeled X): label every sink node in this path as c*. 

In order to show the validity of our construction, it suffices to show that any edge 
(ut,vt) where vt is a sink node labeled with c* is key-consistent. Indeed, notice that 
ut can not be labeled with X, hence ut will be labeled as T,$,X*. In any of these 
cases, the mapping is key-consistent. 

Case 4. The final case covers all the remaining configurations: any such configuration 
will contain no cross edges between sources ur,us and targets vr,vs- Thus, the only 



possible edges are (vr,v s ) or (vs,vr) (sec Figure 4d) 



In this case, the labeling is L(ur) = <&, L{vr) — L(vg) — X* and L(us) = X. 
Moreover, the unsplittability condition guarantees that we have two paths: a path pr 
from vr to any of the endpoints of S that is not cut by ur, and a path ps from vs to 
any of the endpoints of R that is not cut by us- We now distinguish the following cases 
for a sink nodes v in pr U ps ■ 

• v £ pr, \ps- Label with L(v) = X. The labeling for v is valid, since every node 
that has an edge to v can not be labeled with <&. 

• v € ps \pr- Label with L(v) — c* . The labeling for v is valid, since every node 
that has an edge to v can not be labeled with X. 

• v £ pr Ops' Label with L(v) = X* . Again, the labeling is valid, since any node 
with an edge to v will not labeled with I or $. 



6.2.2. Arbitrary Graphs 

We now leverage the hardness construction for DAGs to extend it in the case where 
G[Q] is an arbitrary graph. In order to achieve this, we transform G to a DAG Gjj 
by contracting each strongly connected component (SCC) either to a single node or to 
an edge. The contraction must be such that (a) Go admits a labeling that makes it 
coNP-complete and (b) the labeling of Gd can be always extended to a valid labeling of 
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G. In order to construct Gn, we process each SCC A depending on whether it contains 
only consistent edges or not. 

A is consistent: Contract the SCC A to a single node va- If Lq d (va) — £, assign 
the label I to all the nodes of A in G. 

A contains R or Si Assume w.l.o.g. that A contains R. Contract A to the edge 
{ur, vr) such that: 

• Every node in V(A) that is reachable from ur in A \ {e^} is mapped to ur. 

• Every other node in V(A) is mapped to vr. 

Notice that, since we have no directed path from ur to vr that does not use vr, the 
edge (ur,vr) will never be reduced to a single node. Given a labeling for Go, let 
Lq{v) — Lc d {ur), if v is mapped to ur, else Lq(v) = Lo d (vr). We next show that the 
construction of Gd is well-behaved. 

Lemma 6.12. Gd satisfies the following properties. 

1. Gd is a DAG. 

2. R, S are source-disjoint and unsplittable in Gd- 

Proof. It is straightforward to see that property (1) holds for Gd- To show (2), we first 
observe that e#, e$ remain source-disjoint: indeed, since Gd is a DAG, not being source- 
disjoint would imply that ur — us, a contradiction, since e#, es are source-disjoint in G. 
In order to show that eR, es are unsplittable, consider the path P that connects R with 
S in G and is not cut by fd(R) . In particular, the path will connect node vr with either 
cndpoint of es- Now, consider the corresponding path Pd m Gd- Pd still connects vr 
with either endpoint of es. For the sake of contradiction, assume that Pd is cut in Gd by 
some node v € fd{R). Then, consider the set of nodes m(v) from V(G) that are mapped 
to v during the contraction of G. It is easy to see that, by our construction, m(v) belongs 
in a single SCC. Since all nodes mapped to ur all the nodes in V(A) reachable from ur 
not through eR, this implies that in G we would have a directed path from ur that cuts 
P, a contradiction. □ 

We must also investigate whether extending the labeling of Gd to G as we have 
described is valid. We will see that this adds some additional constraints to how we can 
label Gd- 

Lemma 6.13. Let R belong to the SCC Ar of graph G. The labeling of Gd can be 
extended to G if the mapping Lq d (vr) — > Lq d (ur) is consistent. 

Proof. The labeling for a SCC A that contains only consistent edges is trivially valid, 
since every node of A will be assigned the same label i and the mappings within A will be 
the identity mapping £ —} I. Next, consider Ar and recall that e_R is the only inconsistent 
edge in Ar: we must show that all the other edges in Ar have a consistent mapping. 
Indeed, consider any consistent edge e = (u, v) in Ar. If u is mapped to Ur, then so 
must be v: then, both u, v will be assigned the label Lg d (ur) 7 making it a consistent 
mapping. Suppose that u is mapped to vr: then, Lg(u) = Lq d {vr) and hence the 
mapping for e will be either Lg d (vr) — > Lg d (vr) or Lg d {vr) — > Lg d (ur), and both 
are consistent. □ 
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Let us take a step back now and see the bigger picture. Clearly, if R, S do not belong in 
any SCC, we are done, since the hardness construction for Gjy transfers unconditionally 
to G. However, if either of R, S are part of a SCC, we must make sure that the mapping 
Lq d (vr) — > Lq d {ur) (and similarly for S) is consistent. For this, we first show a simple 
lemma for the structural properties of Go in this case. 

Lemma 6.14. Let R be 'part of an SCC Ar. In Go, there is no directed path us — > vr. 

Proof. Indeed, if such a path existed, we would have a directed path from us to Ar in 
G, which would imply that ur, vr £ fd(S), a contradiction to the fact that there exists 
an uncut path by fd(S) between 5* and R. □ 

We have discussed the case where both R and S do not belong in any SCC. We next 
present the other two cases, when R belongs in SCC An and S does not, and when both 
R, S belong in the SCCs An, As respectively. 



2 SCCs: According to Lemma 6.14 configurations (1) and (2) are not allowed for 
Go- Hence, the only possible cases are (3) and (4). Notice that the mapping X* —> X is 
consistent for the edge es, whereas X* — > $ is not. In order to overcome this problem, 
we simply use instead of X* the label T = $ x X* for the node e R . Indeed, $xIM$ 
defines a consistent mapping, whereas the validity conditions for the labeling remain 
intact. 

1 SCC: Assume that R belongs in the SCC Ar. For this case, note that all 4 
configurations are allowed. We treat (3) and (4) exactly as in the previous case. For 



cases (1) and (2), notice that since by Lemma 6.14 us does not have any directed path 
to vr, only the rightmost edge of |Figure 4a and |Figure 4b| can correspond to eR for the 



graph Gd- As before, the labels X* (or are not valid, so we have to use again 

<f> x X* instead of X* and X, which will retain the validity of the labeling. 

6.3. The PTIME algorithm 

In this subsection, we prove in detail the reverse direction of |Theorem 6.4| 

Theorem 6.15. If every source- disjoint pair of inconsistent edges in G[Q] is splittable, 
Certainty (Q) is in P. 

We prove the theorem in a sequence of steps. We start by defining the graph G s . 
6.3.1. The graph G s 

We first partition the inconsistent edges of G into equivalence classes: two inconsistent 
edges belong in the same class if they are source-joint. Observe that every equivalence 
class C corresponds to a SCC that contains all the source nodes of the edges in C, which 
we call scc(C). 

Next, we create a graph G s as follows. For each equivalence class, we introduce a 
new node in G s . We also introduce an edge (Ci,C2) if there exists an edge es in class 
Ci such that us £ Reach(C\) (or equivalently if there exists a node v in scc{C2) such 
that v £ Reach(Ci)). 

Lemma 6.16. G" satisfies the following properties: 



1. It is a DAG. 
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2. It is transitively closed: if(Ci,C 2 ) and (C 2 ,C 3 ) are edges, so is (Ci,C 3 ). 

Proof. If there was a cycle in G s , the edges that belong to the nodes of the cycle would 
belong in a single equivalent class, a contradiction. This proves (1). For (2), there must 
be an edge es in C2 such that for every en in C\, us <E Reach(R). Similarly, there 
must be an edge in C3 such that for every edge in Ci the same holds: in particular, 
it must be that 117- <G Reach(S). However, this implies that for every edge en in C\, 
U R — > u s — > u t- hence, ut € Reach(R). Consequently, (C\, C3) is an edge in G s . □ 

Recall that a node is a sink node if it has no outgoing edges. Since G s is acyclic, it 
will contain at least one sink node. Additionally, since G s is transitively closed, if the 
sink node C is unique, evert node C <G V(G S ) has an edge (C , C) to C; this fact will be 
of use later in the section. 

6.3.2. Split Sets 

In this part of the section, we define split sets and prove several properties that will 
be useful for the remaining section. 

Definition 6.17 (Split Set). For a node C G V(G S ): 

split{C) = {C G V{G S ) \ {C} I every path between C, C' is cut by Reach(C)} 

Consider an equivalence class of inconsistent edges C, which corresponds to a node 
in G s , and let v e V(G). For v, we define Cc{v) to be the set of all inconsistent edges 
en € C, such that every path ur ~» v goes through en- Note that C Cc{v) C C. 
Moreover, notice that Cc{ v ) = implies that every source node ur of an edge in C 
can reach v through a directed path that does not go through e R . Hence Cq{v) = is 
equivalent toue fd(C). 

Proposition 6.18. For some i = 1,2, every path from an edge cr <E G\ to an edge 
e S <= C-2 contains a node v such that v G fd(Ci). 

Proof. Suppose the proposition does not hold, and fix i = 1. Then, we can find edges 
ej; € Ci, es G C2 such that there exists a path P from vr to either of the endpoints of 
es, where for any node v <G P: Cc x {v) 7^ 0. We will find an edge ex & C\ such that it 
has a path to e s (and in particular us) not cut by fd(T). 

Let v\, ■ ■ ■ , v m = us be the nodes of the path P. Now, consider the last index j such 
that Cc x {vj) ^ C\. If there is no such j, then no node in P belongs is cut by fd(R) 
and hence T — R is our candidate. Otherwise, for this j, C Lai^j) C C\. Since 
there exists an edge eu € C\ \ Cc x {vj), there exists a path ujj ~» Moreover, since Ci 
contains non source-disjoint edges, for every edge eT ^ C\, u T ^ u v ^ Vj. However, 
there exists an edge e-r € Cc x {vj ) , which implies that ut can reach Vj through a path Pj 
that goes through er- Consequently, we have a path vt ^ Vj that is not cut by fd{T). 
Since all the nodes vj + \, . . . ,v m have Cc x {vi) = G\, this implies that there exists a path 
from er to us not cut by fd{T). 

In fact, we can show an even stronger statement. Consider any other edge e C 2 . 
By the structure of C2, we have that ujj U5. Moreover, the path My us cannot be 
cut by fd(T), since then us & fd(T), a contradiction. This implies that, for any edge 
eu € C2, there exists a path from to ejj not cut by fd{T). 
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Similarly for i — 2, we can find an edge es £ C2 such that for any edge e^ € C\, 
we have a path between them not cut by fd(S). However, this implies that e^,es are 
unsplittable, which is a contradiction. □ 



Proposition 6.19. Let v be a node in V(G) \ V(A). If Cc(v) — 0, every directed path 
from A — scc(C) to some node is cut by Reach(C) . 

Proof. Assume a directed path from w £ V(A) to v, such that w is the only node in 
V(A). 

Let u be the first node in the path w ~^ v such that Cc( u ) — 0- We will prove that 
u £ Reach(A). We will first establish that there exists a consistent path w — » u. If v = w, 
this is a trivial statement. Otherwise, consider the precursor of u in the path, denoted 
by Up. By our assumption, there exists some es £ Cc{u p ). However, since es 4- £-c{ u ), 
there exists some path P : us ~» u that does not go through es- Notice that P overlaps 
with the path w — > u only on u; otherwise, es ^ Cc(u p ), a contradiction. Since there 
exists a path from w to us, we have created two edge-disjoint paths from w to u: thus, 
there must be a consistent edge (10, v). 

We will show that u £ Reach(C); in particular, we will show inductively that for 
each v £ V(A), v u. Showing that w — > v is the basis of the induction. Consider a set 
of nodes Vi C V(A) such that for every node in Vi, the node has a consistent path to u. 
Next, let es be an edge such that us £ V(A) \ Vi and vs £ V(A): such an edge always 
exists. By our inductive hypothesis, vs — > u, hence vs — > u. If es is consistent, us — > u 
as well. Otherwise, since es ^ there exists a path us ~-^> u not through es- This 

implies that us has two disjoint paths to some node in the path vs — > u; hence, us — > u. 
We can now create Vi+i — Vi U {us} and this concludes the inductive step. □ 

We are now in position to prove several propositions about split sets. 

Proposition 6.20. Let C\,C2 be sink nodes in G s . Then, either C\ £ splitiC-z) or 
C 2 £ split(Ci). 

Proof. Following from proposition 6.181 we can assume w.l.o.g. that every path from 



C\ to C 2 is cut by some node v £ fd(C±). Thus, = 0. Next, we apply Proposi 

|tion 6.19) since there is a directed path P v from C\ to v, there exists some node w in P v 
such that w £ Reach{C\). However, notice that there can be no path w — > ur, where en 
is an inconsistent edge. Indeed, if that was the case, C\ would have an edge to the node 
in G s that corresponds to the equivalence class of en, a contradiction. Consequently, we 
must have that the path w ~^ v is consistent: w — > v. This immediately implies that 
v £ Reach(Ci) and concludes the proof. □ 

Lemma 6.21. Let Co, Ci, Gi £ V(G S ) such that Cq, C\ are sink nodes. If C2 £ split(Co) 
and C\ (jt split(Co), then Ci £ split(Ci). 

Proof. For the sake of contradiction, assume that C2 ^ split(Ci). Then there exists a 
path Pi 2 : vrj O uji 2 not cut by Reach(Ci), where Pi £ C\ and P2 £ G%. Additionally, 
since C\ £ split(Co), there exists another path Pqi : ur <-> u^/ not cut by Reach(Co), 
where R £ Co and R% £ C±. 

Now, notice that uri and VR t are connected through a path Pi inside the SCC defined 
by C\\ moreover, no node in Pi can be cut by Reach(Co) , since then we would have an 
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edge (C ,Ci) and C would not be a sink node. Hence, since the path P 01 , P l5 P 12 
connects ur? to i*rj 2 , and C2 G split(C'o), it must be that P12 is cut somewhere by 
Reach(Co), and let v be this node. 

Let P v : ur — > v (such a path exists for every edge of Co). Then, the path P12 
until node v plus P v defines a path from vj^ to ur that is not cut by Reach(C\). 
Indeed, Reach(C\) can not cut P Vl since then v G Reach{C\) , which would contradict our 
assumption for the path P 12 . The existence of such a path implies that Co (jz. split{Ci). 

However, this is a contradiction, since Co,C are sinks and hence it must be that 
Co G split(C\) by the previous proposition. □ 

The above lemma allows us to show the following proposition about sink nodes. 

Proposition 6.22. Let Co,Ci be sink nodes in G s . If C\ £ split(Co), split(Co) C 
split(C\). 

Proof. We first show that split(C\) C split(Co). Let C2 G split(Co). By the above 
lemma, C2 G split(Ci) as well. Additionally, we have that Co ^ split(Co) by definition. 
However, since C\ ^ split(C§) and both Co,Ci are sink nodes, it must be that Co G 
split(Ci). □ 

6.4- Isolated Sets 

In this section, we define an isolated set of edges, which is intuitively a set of edges 
that "cuts" itself from the rest of the graph. Isolating sets are a crucial component of 
our PTIME algorithm, since they allow us to reduce the problem recursively into smaller 
instances. 

Definition 6.23 (Isolated Set). A set of inconsistent edges E lso is isolated if there exists 
a set of nodes V lso such that (a) if R G E lso and v G V lso , v G Reach(R) and (b) if an 
inconsistent edge S ^ E lso , any path that connects some R G E lso with S is cut by V lso . 

Theorem 6.24 (Existence of Isolated Sets). // G s has at least two sink nodes, there 
exists an isolated set E lso . 



Proof. By Proposition 6.20 there exist sink nodes Ci,C2 such that C\ G split(C2)- 
Among all such sink nodes C2, take C to be the one such that \split(C)\ is maximum. 
Clearly, split{C) is non-empty. 

First, let us assume that there exists some C" ^ split(C) such that C" 7^ Cr and 
(C", C) ^ E(G S ). Consider a sink node Cr such that (C", Cr); there will always be such 
a node Cr in G s or Cr = C" will be a sink node. It is easy to see that Cr ^ split(C) 



in either case. Applying Proposition 6.22 we get that spZii(Cr) D split{C). Since also 
C G split(CT), Ct contradicts the fact that C is chosen such that the size of split{C) is 
maximum. 

Hence, for every C" G split(C), either C" = C or the edge (C", C) exists. In this case, 
we claim that all the edges in V (G s )\split(C) are an isolated set, where y lso = Reach(C). 
Notice that condition (a) for the isolated set is satisfied in a straightforward way. As for 
(b), consider some C" and assume, for the sake of contradiction, that there exists a path 
P from C" to C' G split{C) not cut by Reach(C"). Notice that, since C" = C or (C", C), 
this implies that P is not cut by Reach(C) as well. But then we would have a path from 
C to C not cut by Reach(C) 7 which contradicts the fact that C' G split(C). □ 
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6.5. The Recursive Algorithm 

The algorithm that we present here uses the notion of an isolated set to split the 
problem into smaller instances, which can be solved recursively. We next present the 
algorithm in detail. 

Given a graph G, we distinguish two cases, depending on whether G s contains a single 
sink node or not. 



6.5.1. Single Sink Node 

In this case, following from |Lemma 6.16[ every node in G s has an edge towards the 
sink node C. As we have previously discussed, every source node in C belongs in a single 



SCC, which we denote Sc — scc{C). In Section 5 we have shown that one can represent 



the set of all possible answers to the full query S c as a set of value-disjoint or-sets of 
polynomial size, As c ■ 

Let ai , &2 be two tuples in As c (I) that belong in different or-sets. Consider now 
any inconsistent edge en of G and let b £ Domc^un). The crucial observation is that 
the choice of a repair for any key-group of b can influence at most one of ai, 2. Notice 
that these tuples are value-disjoint. Additionally, since every edge will have a directed 
consistent path to some node v £ Sc, there is a path ur — > v. But that would mean 
that b is mapped to both a.i[v] ^ ^[v] through this path, a contradiction. 

Now, for an or-set A £ Ag c (I), define as I[A] the subinstance of / that contains for 
the inconsistent edges only the key-groups where the key maps to some value in A. By 
the discussion above, we have: 

Proposition 6.25. I \= G iff there exists an or-set A £ As c (I) such that I [A] \= G. 

It seems on the surface that we have not simplified the complexity of the problem. 
However, observe that the answers in an or-set A are mutually exclusive, i.e. every 
frugal repair contains at most one. Similarly now, we can define for some a £ A, the 
subinstance /[a] that contains the key-groups of the edges mapped to a. 

Proposition 6.26. I[A] 1= G iff for every a £ A : /[a] 1= G. 

Proof. For the one direction, assume that there exists a. £ A such that /[a] \f G. Since A is 
an or-set, this implies immediately that I[A] \f G. The other direction is straightforward 
to show. □ 

So now, let us focus on the instance I [a]. Observe that for this instance, every edge 
that belongs in C has exactly one key-group. Hence, we can create a polynomially large 
number of instances for the choices of C. Each instance generated now will have the 
property that the edges in C are consistent. We have thus reduced the problem to one 
where the number of nodes of G s is one less. The base of the recursion has an instance 
where every edge is consistent, where I N Q iff Q(I) is evaluated to true. 

6.5.2. Two Sink Nodes 



In this case, we can apply Theorem 6.24 to find a set of inconsistent edges E %so that 



are isolated by a set of nodes V lso . 

Let V lso = {iDi, . . . ,Wf.} and consider a possible evaluation of the tuple (wi, ...,Wk) 
to some a = (ai, . . . ,a k ). Since every source node of an edge in E ls ° has a directed 
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consistent path to every node in V' lso , for each R e E tso , there exists a set of values 
Wfl(a) C Doma(un) such that every such value maps to the tuple a. The crucial 
observation is that, for a ^ a', we must have that mji(a) n mji(a') = 0. 

Let G lso be the induced graph that contains all nodes in V lso , along with the nodes 
of the isolated edges and any node that, after having removed V tso , is still connected to 
some edge in E' lso . Consider a new instance of Certainty(), where G' lso is our graph 
and for each edge R € E tso , the relation is limited only to key-groups from m#(a). Let 
I[a] be this sub-instance of the database. 

Define by r[a] the set of frugal repairs that contain a as an answer. Then, if I[a] \f 
G l so, it must be that r(a) = 0. On the other hand, assume that I[a] \= G l so, i.e. every 
repair is satisfying. Then, whether I 1= Q or not will be independent of the choices in 

/[a] - 

This implies the following recursive algorithm. For each such a, construct and re- 
cursively solve the instance 1(a) for the graph G tso . Notice that, according to our 
previous lemma, G lso will contain at least one less equivalence class than G. Let 
A lso = {a | I[a] N G tso }. Then, replace in G the subgraph G tso by an arbitrary new 
node n 1 that has edges to all nodes in V tso . Moreover, for Wi <E V lso the relation cor- 
responding to the edge S^n 1 ,Wi) will contain a tuple Si(a, a,) for every a e A lso . Let 
G' be the newly constructed instance. Notice that the new instance G' contains at least 
one less node for G s , hence it is an easier subproblem. 

Example 6.27. Continuing our example on G[Q a ], G tso will be the graph G[Q a ] without 
the edge Ri(x,y). We can solve any instance on G lso in a straightforward way, since 
we can make a choice for each value of u independently. Then, G' will be the graph: 
J?i (x, y), Syjn 1 , y). Since S y is a consistent edge, G' contains only one inconsistent 
edge and can also be solved in the same straightforward way. Thus, we can compute 
Certainty (Q a ) in PTIME. 

In order to show that the algorithm is indeed valid, it remains to show that both 
G tso , G' can still be solved in PTIME, i.e. they still contain no source-disjoint edges that 
are unsplittable. 

Lemma 6.28. G lso and G' do not contain any source-disjoint and unsplittable pair of 
edges. 

Proof. For both G lso , G' , it suffices to show that any path P in cither graph that connects 
two nodes C, C' and is cut by Reach(C) in G s , it is still cut by Reach(C). 

First, we consider G lso . The only violation may happen when P is cut by Reach(C) 
outside of G l so. But then, P must have a node in V lso , which means that Reach(C) will 
still cut it inside G l so. 

Finally, we consider the case of G' . Again, the only problem is when P is cut by 
Reach(C) outside G' . Assume that the cut happens at some node v. Then, for every 
edge en e C, we have a path Pr : ur — > v. But then, Pr will meet some node u G V lso . 
However, this implies that v € V lso as well and P will be cut by Reach(C) inside G' as 
well. □ 

This proves our main theorem. As a side note, observe that if the graph is a DAG, we 
do not need to compute any or-sets and every other operation related to the algorithm 
is FO-expressible. Hence: 
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Theorem 6.29. If G[Q] is a DAG and contains no pair of source-disjoint and unsplit- 
table edges, Certainty (Q) is FO-expressible. 

7. Trees 

When G[Q] is a tree (i.e. contains no undirected cycles), the corresponding query Q 
is acyclic. The dichotomy condition then becomes as follows. 

Corollary 7.1. If G[Q] is a tree, Certainty (Q ) is coNP- complete if there exist two 
inconsistent edges R, S such that the unique path Vr -h- vs does not contain R or S; 
otherwise, it is FO-expressible. 

For this case, [TT] proves a dichotomy into queries that allow first-order rewriting and 
queries that do not. Recall that our algorithms for DAGs are always FO-expressible; thus, 
our condition for coNP-hardness must coincide with the non FO-expressibility condition 
presented by Wijsen for the case where every edge is inconsistent. We show next that 
this is indeed the case. 

To draw the comparison with note that we have to restrict our setting such 
that every edge is inconsistent. In this case, we show in |Section"7| that indeed the two 
conditions are the same. 

Given the acyclic query Q, construct the join tree t by introducing a node for each 
atom and an edge labeled with x whenever two atoms share variable x (since the atoms 
are binary, they can share at most one variable): the resulting graph r will be a tree. 
Wijsen then defines an additional graph, called the attack graph, which again contains 
the atoms as nodes. In order to add the edges, consider an atom R in Q: for the variable 
ur (i.e. the key variable in R), let us define the set R + 'Q as the nodes reachable from 
ur in G[Q] \ {e_R.}. Then, we add an directed edge (R, S) to the attack graph if and only 
if no label on the unique path that connects R and S in t is contained in R + 'Q . 

We can now describe the criterion for the dichotomy on first-order rewriting: if the 
attack graph is cyclic, Certainty(Q) is not FO-expressible; otherwise, it is. We can 
now prove the following lemma. 

Lemma 7.2. If G[Q] contains two edges R, S such that the unique path Vr -f-> vs does 
not contain R or S, the attack graph contains a cycle. Otherwise, the attack graph is 
acyclic. 

Proof. Let L be the set of labels that appear in the unique path from R to S in the 
attack graph. We will show that R + '® , S + '® have an empty intersection with L. This 
implies that there will be an edge from R to S and vice versa in the attack graph, hence 
proving the claim. Indeed, consider the set R +, Q, which contains the nodes reachable 
from ur in G[Q] \ {e^}. However, since eR is cut, ur can not reach any of the nodes in 
L. 

For the second direction, following [TT], it suffices to show that there is no cycle 
of length 2. Consider any two edge R, S. Then, the path from R to S in the join 
tree contains at least one label from Ur,Us- let is be w.l.o.g. Ur. However, note that 
Ur E R + 'Q. This implies that there is no edge (R,S) in the attack graph, hence there 
will be no cycle of length 2. □ 
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The lemma, in combination with our previous discussion implies that, for the case 
where G[Q] is a tree and every edge is inconsistent, the two conditions for the dichotomy 
coincide. 

8. Conclusion 

In this paper, we make significant progress towards proving a dichotomy for the 
complexity of Certainty (Q), where Q is a conjunctive query w/o self-joins, studying 
the specific case where the atoms have simple keys or the key consists of all attributes. 
It remains still an open problem whether such a dichotomy exists for general conjunctive 
queries. It is also interesting to study whether the dichotomy can be strengthened to a 
trichotomy, to FO-expressible, PTIME and coNP-complctc. 
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